What Character.ai Announced
Squinch: 6‑Bit Gradient Compression
Squinch is a blockwise 6‑bit gradient compression algorithm invented by Noam Shazeer to cut inter‑node communication cost without hurting accuracy versus bfloat16 training. It targets transformer gradient distributions specifically, which tend to be well‑regularized and amenable to aggressive quantization.
Key properties:
- Each block contains 8 gradient values and is compressed into 48 bits, encoding both sign and magnitude.
- The maximum absolute value in a block is mapped to an 8‑bit
q_maxusing a log transform, which defines a shared dynamic range for that block. - Individual elements are quantized via a square‑root mapping into 4‑bit
q_elems[i], preserving relative differences while minimizing log‑error.
Formally (simplified):
Attention Z‑Reg: Keeping Logits in the Sweet Spot
Attention Z‑Reg is a regularization technique applied to attention and linear logits to keep their log‑sum‑exp (“Z” value) near zero, maximizing the effective precision of bfloat16 during training. As logit magnitudes grow, bfloat16 spacing between representable numbers increases, which can degrade gradient quality and stability.
Core idea:
- Define a virtual term:
- Instead of treating this as a real loss, the gradient is injected directly during backward for attention, steering logits toward a numerically safe range.
Dynamic Clamping: Quantization‑Aware Stability
Dynamic clamping addresses a subtle failure mode in quantization‑aware training: tiny activation ranges collapsing to all zeros after quantization. This is especially relevant in FFNs with activations like ReLU2, where scaled weights can cause intermediate tensors to occupy extremely narrow value bands.
Visibility Mask: Smarter Attention Batching
Visibility mask is a compact attention API that replaces large sparse boolean masks with two integer tensors per token: visibility_start and visibility_limit. These encode which positions each token can attend to during training and inference.
Mechanics:
- Shape: both tensors have shape
For each token:
- Positions with index <
visibility_startcannot attend to this token. - Positions with index ≥
visibility_limitcannot attend to this token.
This representation supports:
- Causal attention for a single document, where tokens can only see past positions.
- Multiple independent documents packed into one sequence, with disjoint visibility ranges.
- Tree‑structured documents, where parents and children have carefully designed mutual visibility.
- Beam search with empty slots in paged attention, while still leveraging efficient packed batches.
- Bidirectional prefixes followed by causal tokens, common in chat and instruction tuning.
Examples from the article show how different visibility_start and visibility_limit arrays encode causal, multi‑doc, tree, and beam‑search scenarios over simple token sequences like [A B C D E] or [A AA AAA B BB BBB C CC CCC].
0 Comments